aiming off the target: studying repetitive dna using target ......2020/12/10 · 109 (novak et al.,...
TRANSCRIPT
-
1
Aiming off the target: studying repetitive DNA using target capture sequencing reads 1
2
Lucas Costa1, André Marques2, Chris Buddenhagen3, William Wayt Thomas4, Bruno Huettel5 3
Veit Schubert6, Steven Dodsworth7, Andreas Houben6, Gustavo Souza1, Andrea Pedrosa-4
Harand1 5
6
1 Laboratory of Plant Cytogenetics and Evolution, Department of Botany, Federal University of Pernambuco, 7
Recife-PE, Brazil 8
2 Max Planck Institute for Plant Breeding Research, Cologne, Germany 9
3 AgResearch, Plant Functional Biology, Ruakura, New Zealand 10
4 New York Botanical Garden, Bronx, New York, United States of America 11
5 Max Planck Genome Centre Cologne, Max Planck Institute for Plant Breeding Research, Cologne, Germany 12
6 Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben, Seeland, Germany 13
7 School of Life Sciences, University of Bedfordshire, Luton, UK 14
15
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for thisthis version posted December 11, 2020. ; https://doi.org/10.1101/2020.12.10.419515doi: bioRxiv preprint
mailto:[email protected]://doi.org/10.1101/2020.12.10.419515
-
2
SUMMARY 34
• With the advance of high-throughput sequencing (HTS), reduced-representation 35
methods such as target capture sequencing (TCS) emerged as cost-efficient ways of 36
gathering genomic information. As the off-target reads from such sequencing are 37
expected to be similar to genome skims (GS), we assessed the quality of repeat 38
characterization using this data. 39
• For this, repeat composition from TCS datasets of five Rhynchospora (Cyperaceae) 40
species were compared with GS data from the same taxa. 41
• All the major repetitive DNA families were identified in TCS, including repeats that 42
showed abundances as low as 0.01% in the GS data. Rank correlation between GS and 43
TCS repeat abundances were moderately high (r = 0.58-0.85), increasing after filtering 44
out the targeted loci from the raw TCS reads (r = 0.66-0.92). Repeat data obtained by 45
TCS was also reliable to develop a cytogenetic probe and solve phylogenetic 46
relationships of Rhynchospora species with high support. 47
• In light of our results, TCS data can be effectively used for cyto- and phylogenomic 48
investigations of repetitive DNA. Given the growing availability of HTS reads, driven 49
by global phylogenomic projects, our strategy represents a way to recycle genomic data 50
and contribute to a better characterization of plant biodiversity. 51
52
Keywords: Genome Skimming, holocentric, Reduced-representation sequencing, 53
RepeatExplorer, Rhynchospora, satellite DNA, transposable elements 54
55
56
57
58
59
60
61
62
63
preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for thisthis version posted December 11, 2020. ; https://doi.org/10.1101/2020.12.10.419515doi: bioRxiv preprint
https://doi.org/10.1101/2020.12.10.419515
-
3
INTRODUCTION 64
One intriguing aspect of eukaryotic genomes is the staggering 60,000-fold variation in 65
DNA content among species (Elliott & Gregory, 2015). Advances in genomics have led to the 66
discovery that most of this diversity is the result of variable amounts of repetitive DNA, 67
commonly divided into tandem repeats and dispersed repeats (Weiss-Schneeweiss et al., 2015; 68
Elliott & Gregory, 2015). Tandemly distributed satellite DNAs are remarkable for their fast 69
evolution and variance in abundance and structure at all hierarchical levels (Novák et al., 2017; 70
Ávila Robledillo et al., 2018). As for dispersed repetitive sequences, transposable elements 71
(TE) are especially abundant in flowering plant genomes (Jurka et al., 2011; Galindo-González 72
et al., 2017). Retrotransposons, particularly the ones possessing a long terminal repeat (LTR-73
retrotransposons) account for most of this abundance (Weiss-Schneeweiss et al., 2015) with 74
two major superfamilies being recognized (Ty1/copia and Ty3/gypsy) based on the order of 75
the protein coding domains, and further divided into a number of lineages according to 76
phylogenetic distance (Neumann et al., 2019). 77
Contrasting with the previous notion that repetitive DNA was no more than “junk 78
DNA”, cytogenomic studies in the last few decades helped to uncover possible roles for tandem 79
and dispersed repeats. Specific satellite DNAs and retrotransposons have been found to have a 80
functional and/or structural role in centromeres (Cheng et al., 2002; Nagaki et al., 2003; 81
Houben & Schubert, 2003; Marques et al., 2015; Macas et al., 2015; Ribeiro et al., 2017). The 82
role of LTR-retrotransposons in genome size variation led to investigations correlating these 83
to heterochromatin distribution, ecological variables, community structure and plant 84
distribution (Guignard et al., 2016; Van-Lume et al., 2017; Lyu et al., 2018; Souza et al., 2019). 85
The activity of transposable elements in the host genome can also cause modifications to gene 86
regulation and the formation of retrogenes, generating morphological innovations and 87
impacting speciation processes (Schrader & Schmitz, 2019). Moreover, lineage-specific 88
satellite DNAs have been widely used as efficient cytomolecular markers, allowing the 89
identification of chromosome pairs and elucidating chromosome rearrangement and 90
duplication events (Koo et al., 2011; Čížková et al., 2013; Ávila Robledillo et al., 2018; Ribeiro 91
et al., 2020) . 92
The fast evolution of repetitive DNA, with many satellite families being genus- or 93
species-specific, impair their use in phylogenetic studies, as concerted evolution and 94
homogenisation reduces sequence variability for comparative studies across taxa (Macas et al., 95
2015; Mascagni et al., 2020; Ribeiro et al., 2020). Their abundances, however, have 96
preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for thisthis version posted December 11, 2020. ; https://doi.org/10.1101/2020.12.10.419515doi: bioRxiv preprint
https://doi.org/10.1101/2020.12.10.419515
-
4
phylogenetic signal, as demonstrated by a method to reconstruct phylogenetic relationships 97
using the abundance of different repetitive elements (Dodsworth et al., 2015). This method has 98
proven useful to elucidate relationships in different groups of plants and animals (Dodsworth 99
et al., 2017; Bolsheva et al., 2019; Martín-Peciña et al., 2019). Other methods have assessed 100
the usefulness of repeat-based phylogenetic analysis. More recently, it was demonstrated that 101
sequence similarity measures of repeated sequences can also be used as characters to resolve 102
phylogenetic relationships (Vitales et al., 2019). In a similar approach, assembly and alignment 103
free (AAF) methods can be applied to high complexity fractions of the genome, such as 104
repetitive DNA, in a phylogenomic framework (Fan et al., 2015; Sarmashghi et al., 2019). 105
In order to identify and characterize the diversity of repeats in a genome, the 106
RepeatExplorer pipeline was created, using a graph-based clustering algorithm to group high-107
copy sequences based on similarity, with posterior identification of protein-coding domains 108
(Novak et al., 2013). RepeatExplorer can identify repetitive elements with approximately 0.1× 109
genome coverage, most commonly known as genome skimming (GS), which is often sufficient 110
to study high-copy nuclear and organellar DNA (Straub et al., 2012; Dodsworth, 2015; 111
Dodsworth et al., 2019). The GS method is just one of several “reduced representation” 112
methods of high-throughput sequencing. Throughout the last decade, a number of these 113
sequencing methods have been proposed, generating high quality sequencing data at decreasing 114
costs, such as restriction site-associated sequencing (RAD-seq, Eaton et al. 2016) and 115
transcriptome sequencing (RNA-seq, Wang et al. 2017). 116
Another widely used reduced representation method is target capture sequencing 117
(TCS), in which several genomic probes are designed to “capture” and enrich specific low-118
copy coding regions of the nuclear genome (Albert et al., 2007; Gnirke et al., 2009). These 119
probes can be designed based on conserved regions retrieved from the alignment of several 120
genomes of a divergent set of organisms (e.g. Anchored Hybrid Enrichment, Lemmon et al., 121
2012) or by comparing transcriptomic data to search for a conserved set of orthologs across a 122
group (Johnson et al., 2019). Since the development of TCS (Albert et al., 2007), many sets of 123
probes have been developed and applied in both plants and animals (Cosart et al., 2011; 124
Faircloth et al., 2012; Mandel et al., 2014; Ilves & López-Fernández, 2014; Sass et al., 2016; 125
Heyduk et al., 2016; Schmickl et al., 2016). Moreover, the nature of these probe design 126
approaches have allowed the development of universal probe sets, with the potential to be used 127
across all of the angiosperms (e.g. Buddenhagen et al. 2016; Johnson et al. 2019). In addition 128
to the advantage of being universal, high recovery rate of target regions (or the number of 129
enriched targeted loci in the final library) is often achievable with very low “enrichment 130
preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for thisthis version posted December 11, 2020. ; https://doi.org/10.1101/2020.12.10.419515doi: bioRxiv preprint
https://doi.org/10.1101/2020.12.10.419515
-
5
efficiency” (percentage of library reads successfully mapped to a target sequence) (Johnson et 131
al., 2019), meaning that a TCS enriched library will often contain a high number of “off-target” 132
reads. The use of these off-target reads, in combination with the enriched reads, has been 133
referred to as “Hyb-Seq” (hybrid sequencing, Weitemier et al., 2014). As the off-target reads 134
are often rich in high-copy DNA, Hyb-Seq approaches have been used to reconstruct whole 135
plastomes and ribosomal DNA (Weitemier et al., 2014; Schmickl et al., 2016). 136
To check whether off-target reads could also be used for identifying, and potentially 137
quantifying, repetitive DNA, we selected the sedge Rhynchospora Vahl as a model. 138
Rhynchospora is one of the largest genera of Cyperaceae Juss., comprising approximately 350 139
species, but it is under-studied from a phylogenetic point of view (Thomas et al., 2009; 140
Buddenhagen et al., 2016). Cytologically, Rhynchospora has been of great interest due to its 141
holocentric chromosomes, which present the centromere dispersed along the sister chromatids 142
rather than localized in a primary constriction (Bureš et al., 2013). Moreover, cytogenomic 143
studies on R. pubera (Vahl) Boeckeler have led to the discovery of the first centromere-specific 144
satellite DNA reported in a holocentric organism, Tyba (Marques et al., 2015). Subsequent 145
studies confirmed the presence of Tyba in other Rhynchospora species (Ribeiro et al., 2017). 146
Nevertheless, other non-centromeric satellites found in Rhynchospora species showed the 147
typical block-like pattern on localized chromosomal regions (Ribeiro et al., 2017). 148
Large-scale repeat analysis covering all major clades of Rhynchospora would provide 149
valuable insights into the repeat evolution of this genus. Recently, using a set of probes 150
developed by Anchored Hybrid Enrichment (Buddenhagen et al., 2016), a number of 151
Rhynchospora species were sequenced using a TCS approach. Here, we assessed the quality of 152
repeat characterization in five Rhynchospora species from this dataset. As a considerable part 153
of the off-target reads can be sequences close to the original targeted loci (Dodsworth et al., 154
2019), we searched for sequencing bias by comparing the target results with results obtained 155
from genome skimming (i.e. unenriched libraries). We specifically addressed three questions: 156
1) Can we characterize the repetitive DNA fraction of Rhynchospora genomes using TCS data, 157
compared to GS data?; 2) Can we develop cytological markers from TCS clustering data?; and 158
3) Can we use TCS data to reconstruct repeat-based phylogenetic trees? 159
160
MATERIALS AND METHODS 161
162
Plant material and sequence data 163
preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for thisthis version posted December 11, 2020. ; https://doi.org/10.1101/2020.12.10.419515doi: bioRxiv preprint
https://doi.org/10.1101/2020.12.10.419515
-
6
Individuals of Rhynchospora cephalotes (L.) Vahl and R. exaltata Kunth were collected 164
near the towns of Jacaraú (Paraíba, Brazil, voucher UFP87625) and Jaqueira (Pernambuco, 165
Brazil, voucher JPB51537), respectively. These individuals were cultivated in (i) the 166
experimental garden of the Laboratory of Plant Cytogenetic and Evolution at the Universidade 167
Federal de Pernambuco (Brazil), (ii) the greenhouse of the Max Plank Institute for Plant 168
Breeding Research (Cologne, Germany) and (iii) the greenhouse of the Leibniz Institute of 169
Plant Genetics and Crop Plant Research (IPK Gatersleben, Germany). 170
We downloaded available short read archive data of R. pubera (Vahl) Boeckeler (Marques 171
et al., 2015, BioProject PRJEB9643) from the NCBI GenBank (www.ncbi.nlm.nih.gov). 172
Genome skimming sequences of R. globosa (Kunth) Roem. & Schult. and R. tenuis Link were 173
obtained from Ribeiro et al. (2017) and deposited on GenBank under BioProject 174
PRJNA672922. TCS reads (150 bp) for R. cephalotes, R. exaltata, R. globosa, R. pubera and 175
R. tenuis were obtained from Buddenhagen (2016) and deposited on GenBank under BioProject 176
PRJNA672127. 177
178
DNA extraction and sequencing 179
Genomic DNA from R. cephalotes and R. exaltata was extracted from leaves with 180
NucleoBond HMW DNA kit (Macherey and Nagel, Düren, Germany). Quality was assessed 181
with Agilent TapeStation and the gDNA was quantified by Qubit BR assay (Thermo). An 182
Illumina-compatible library was then prepared from 400 ng input gDNA with an NEBNext 183
Ultra™ II FS DNA Library Prep Kit for Illumina (New England Biolabs) with a total of four 184
PCR cycles to introduce dual indexed barcodes. Libraries were sequenced in the Max Planck 185
Genome Centre Cologne on a HiSeq2500 system with 2× 250 bp rapid mode using a HiSeq 186
Rapid PE Cluster and Rapid SBS v2 kit. The new sequence data was deposited on GenBank 187
under BioProject PRJNA672693. 188
189
Genome size measurements 190
Novel DNA content measurements for R. cephalotes and R. exaltata were estimated by 191
flow cytometry. Sample preparation was done according to Loureiro et al. (2007). Young 192
leaves of each of the studied plants were chopped simultaneously with their respective 193
reference standard, Solanum lycopersicum cv. Stupicke (2C = 1.96 pg, Dolezel et al., 1992) for 194
R. cephalotes and Raphanus sativus L. cv. Saxa (2C = 1.11 pg, Dolezel et al., 1992) for R. 195
exaltata in a Petri dish (kept on ice) containing 2 mL of Woody Plant Buffer (WPB). The 196
sample was then filtered through a 30-µm disposable mesh filter (CellTrics, SYSMEX, 197
preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for thisthis version posted December 11, 2020. ; https://doi.org/10.1101/2020.12.10.419515doi: bioRxiv preprint
http://www.ncbi.nlm.nih.gov/https://doi.org/10.1101/2020.12.10.419515
-
7
Norderstedt, Germany) with following addition of 50 μg/mL propidium iodide (from a stock 198
of 1 mg/mL; Sigma-Aldrich) and 50 µg/mL RNase (Sigma-Aldrich). Nine replicates per 199
species were made. The samples were measured in a CyFlow Space flow cytometer 200
(SYSMEX) equipped with a green laser (532 nm). Histograms of relative fluorescence were 201
obtained using the software Flomax v.2.3.0. (SYSMEX, Norderstedt, Germany). Mean 202
fluorescence and coefficient of variation were assessed at half of the fluorescence peak. The 203
absolute DNA content (pg/2C) was calculated multiplying the ratio of the G1 peaks by the genome 204
size of the internal standard. 205
206
Filtering of TCS reads 207
As our aim was to characterize the repeat fraction of the sequenced species, we were 208
only interested in the off-target reads, whose abundance is inversely proportional to the 209
efficiency of target sequence enrichment. To get rid of the target reads, we filtered the raw TCS 210
data by mapping them to a set of 256 sequences representing consensus sequences of the target 211
loci that were enriched in the Rhynchospora dataset (Buddenhagen, 2016), saving the 212
unmapped reads for the repeat characterization. Two different mapping algorithms were used 213
for comparison: the Geneious read mapper v6.0.3, with custom sensitivity settings (60% 214
Maximum mismatch per read, Index world length = 12, Maximum ambiguity = 8, Kearse et 215
al., 2012) and the BowTie2 v2.4.1 mapper with high sensitivity settings (0 to 800 insert size, 216
report all matches, Langmead & Salzberg, 2012), both implemented in the software Geneious 217
v.7.1.9 (Kearse et al., 2012). Additionally, we uploaded a FASTA file with the 256 target 218
sequences to RepeatExplorer (https://repeatexplorer-elixir.cerit-sc.cz/) prior to the analysis as 219
a Custom Repeat Database (Novak et al., 2013). With this option we could exclude clusters of 220
enriched gene sequences mistakenly identified as repeats from the analysis. In summary, we 221
ended with one GS dataset (comprising of previously published datasets for R. globosa, R. 222
pubera and R. tenuis and the newly sequenced R. cephalotes and R. exaltata data) and four 223
different “target datasets” for each species: 1) Raw target capture reads; 2) Reads left after 224
mapping with Geneious read mapper; 3) Reads left after mapping with BowTie Mapper and 225
4) Reads left after exclusion of “enriched gene clusters” annotated according to a custom 226
Repeat Database (RE). 227
228
In silico Repeat Analysis 229
In order to compare the repeat composition observed in all different datasets, we 230
employed the RepeatExplorer pipeline (Novak et al., 2013). Reads from all datasets of R. 231
preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for thisthis version posted December 11, 2020. ; https://doi.org/10.1101/2020.12.10.419515doi: bioRxiv preprint
https://repeatexplorer-elixir.cerit-sc.cz/https://doi.org/10.1101/2020.12.10.419515
-
8
cephalotes, R. exaltata, R. globosa, R. pubera and R. tenuis were uploaded to the platform, 232
filtered by quality with default settings (95% of bases equal to or above the quality cut-off 233
value of 10) and interlaced. Clustering was performed with default settings of 90% similarity 234
over a 55% minimum sequence overlap. The Find RT Domains tool and additional database 235
searches (Genbank) were used to identify protein domains for repeat annotation, and graph 236
layouts of individual clusters were examined interactively using the SeqGrapheR tool (Novak 237
et al., 2013). 238
Although the entire set of reads from each dataset was uploaded to RepeatExplorer, we 239
used the Read Sampling option on the clustering analysis to manually input the number of reads 240
to be analysed, accounting for 0.13× of the genome of each species (Table 1). Only for R. 241
globosa was this not possible, as no information on genome size was available for this species. 242
In this case, we ran tests with 500,000, 1,000,000 and 2,000,000 reads. However, independently 243
of the number of reads used as input, only 200,061 reads were analysed, always with similar 244
results. Clusters with at least 0.01% genome abundance were automatically annotated and 245
manually checked. The genome proportion of high copy repeats was estimated based on the 246
number of reads of the annotated clusters. 247
248
Table 1 – Genome size (2C), number of reads from GS (genome skimming) and raw TCS datasets, percentage of 249
mapped reads on Bowtie Mapper and Geneious Read Mapper datasets, percentage of reads identified on “target 250
clusters” of RepeatExplorer (RE) dataset and number of reads analysed for all datasets on each species. 251
Species 1C
(Mbp)
Number of
GS read
pairs
Number of
raw TCS
reads d
% of filtered TCS reads* Read pairs
analysed
BowTie Geneious RE
Rhynchospora cephalotes
(L.) Vahl 356.97 11,737,291 6,932,932 11.77 23.12 16.80 600,000
Rhynchospora exaltata
Kunth 244.5 21,384,261 5,151,874 13.02 20.65 18.45 422,199
Rhynchospora globosa
(Kunth) Roem. & Schult. - 4,000,000
c 3,713,422 4.10 14.95 8.82 200,061
Rhynchospora pubera
(Vahl) Boeckeler 1,613.7
a 4,000,000
a 4,164,982 3.96 12.29 8.66 2,797,919
Rhynchospora tenuis
Link 381.42
b 4,000,000
c 4,826,084 10.18 20.64 6.36 661,128
aMarques et al., 2015;
bRibeiro et al,. 2018;
cRibeiro et al., 2017;
dBuddenhagen 2016 252
*% of target reads idenfied by the different strategies 253
254
preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for thisthis version posted December 11, 2020. ; https://doi.org/10.1101/2020.12.10.419515doi: bioRxiv preprint
https://doi.org/10.1101/2020.12.10.419515
-
9
We used the TAREAN tool (Novák et al., 2017) available in the RepeatExplorer 255
pipeline to annotate satellite DNAs. Satellites were named based on previous publications or 256
by using the species abbreviation followed by SAT, a number based on the decreasing order of 257
abundance and a hyphen followed by the number of basepairs of the monomer. The consensus 258
monomer sequences of the identified satellite DNAs of each species were compared using 259
DOTTER (Sonnhammer & Durbin, 1995). 260
261
Testing Method Performance 262
To assess if the abundances of the different repeats observed in our raw and filtered 263
target capture datasets were comparable, we compared their abundances with the ones observed 264
in the GS datasets. For this, we used the abundance values of the individual repeat families 265
(Table S1). As the data did not follow a normal distribution, we used Spearman’s rank 266
correlation. The analysis was undertaken with the package stats implemented in the software 267
R v4.0.2 (R Core Team, 2019). Correlation plots were constructed with the R package ggplot2 268
(Wickham, 2016). 269
270
Repeat amplification, probe labelling and in situ hybridization 271
Primers for the R. cephalotes Tyba variant found in the TCS datasets (see results) were 272
designed based on the most conserved region of the consensus sequence (F: 5’-273
AAGCTATTTGAATGCAATTATGTGC; and R: 5’AGCGTTTCTAGCCACATTTGA). 274
Genomic DNA (40 ng) of R. cephalotes was used for PCR reaction with 1× PCR buffer, 2 mM 275
MgCl2, 0.1 mM of each dNTP, 0.4 μM of each primer, 0.025 U Taq polymerase (Qiagen) and 276
water. The PCR conditions were as follow: 94°C 2 min, 30 cycles of 94°C 50 s, 58°C 50 s and 277
72°C 1 min and 72°C 10 min. PCR products were labelled with Atto488-dUTP (Jena 278
Bioscience) with a nick translation labelling kit (Jena Bioscience). 279
Mitotic chromosomes of R. cephalotes were prepared from root tips, pre-treated in 2 280
mM 8-hydroxyquinoline at 10°C for 20 h and fixed in ethanol: acetic acid (3:1 v/v) for 2 h at 281
room temperature and stored at −20°C. Fixed root tips were digested with 2% cellulose, 2% 282
pectinase and 2% pectolyase in citrate buffer (0.01 M sodium citrate and 0.01 M citric acid) 283
for 120 min at 37°C and squashed in a drop of 45% acetic acid. Fluorescent in situ hybridization 284
was performed as described by Aliyeva-Schnorr et al., (2015). The hybridization mixture 285
contained 50% (v/v) formamide, 10% (w/v) dextran sulfate, 2×SSC, and 5 ng/μl of the probe. 286
Slides were denatured at 75°C for 5 min, and the final stringency of hybridization was 76%. 287
288
preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for thisthis version posted December 11, 2020. ; https://doi.org/10.1101/2020.12.10.419515doi: bioRxiv preprint
https://doi.org/10.1101/2020.12.10.419515
-
10
Immuno-FISH 289
To visualize the centromeres of R. cephalotes, we performed immunostaining of the 290
centromere-specific histone variant CENH3 with polyclonal antibodies developed for R. 291
pubera (RpCENH3, Marques et al., 2015). Mitotic preparations were made from root 292
meristems fixed in 4% paraformaldehyde in Tris buffer (10 mM Tris, 10 mM EDTA, 100 mM 293
NaCl, 0.1% Triton, pH 7.5) for 5 minutes on ice under vacuum and for another 25 minutes only 294
on ice. After washing twice in Tris buffer, the roots were chopped in LB01 lysis buffer (15 mM 295
Tris, 2 mM Na2EDTA, 0.5 mM spermine 4HCl, 80 mM KCl, 20 mM NaCl, 15 mM β-296
mercaptoethanol, 0.1% Triton X-100, pH 7.5), filtered through a 50 μm filter (CellTrics, 297
Sysmex), diluted 1:10, and subsequently, 100 μl of the diluted suspension were centrifuged 298
onto microscopic slides using a Cytospin3 (Shandon, Germany) as described by Jasencakova 299
et al. (2001). Immuno-FISH with anti-RpCENH3 antibodies and the Tyba repeat was 300
performed according to Ishii et al. (2015). We used rabbit anti-RpCENH3 (diluted 1:200) as 301
primary antibody and detected it with Cy3-conjugated anti-rabbit IgG (Dianova) secondary 302
antibody (diluted 1:200). Slides were incubated overnight at 4°C and washed three times in 303
1×PBS before the secondary antibody were applied. 304
305
Microscopy 306
For widefield microscopy, we used an epifluorescence microscope BX61 (Olympus) 307
equipped with a cooled CCD camera (Orca ER, Hamamatsu). To achieve super-resolution of 308
~120 nm (with a 488 nm laser excitation), we applied spatial structured illumination 309
microscopy (3D-SIM) using a 63×/1.40 Oil Plan-Apochromat objective of an Elyra PS.1 310
microscope system and the software ZENBlack from Carl Zeiss GmbH (Weisshart et al., 311
2016). 312
313
Comparative repeat phylogenomics 314
We employed a repeat abundance-based phylogenetic inference method (see details in 315
Dodsworth et al., 2015) to assess if repeat abundance identified in our TCS reads could be used 316
to resolve phylogenetic relationships, using one of our filtered datasets (BowTie) and the GS 317
dataset for comparison. First, we concatenated reads for our five species with 0.065× coverage, 318
with species-specific codes for each set of reads, and ran a comparative clustering analysis 319
(simultaneous clustering of all species on the dataset) on RepeatExplorer with default settings 320
(Novak et al., 2013). As the sequences were coded with the species names, we could identify 321
the number of reads that each species contributed to each of the generated clusters, which is 322
preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for thisthis version posted December 11, 2020. ; https://doi.org/10.1101/2020.12.10.419515doi: bioRxiv preprint
https://doi.org/10.1101/2020.12.10.419515
-
11
proportional to genome. Parsimony analysis using repeat abundances as quantitative characters 323
was undertaken as described by Dodsworth et al. (2015). 324
To access the phylogenetic potential of repetitive elements based on sequence 325
similarity, we used the alignment and assembly free (AAF) approach (Fan et al., 2015) using 326
all reads identified as repeats by RepeatExplorer in the Bowtie dataset. AAF constructs 327
phylogenies directly from unassembled genome sequence data, bypassing both genome 328
assembly and alignment. Thus, it calculates the statistical properties of the pairwise distances 329
between genomes, allowing it to optimize parameter selection and to perform bootstrapping. 330
In order to compare our repeat abundance-based phylogeny with a nuclear marker-331
based phylogeny, we extracted the aligned sequences of 256 loci gathered by target capture 332
(Buddenhagen, 2016) of our five Rhynchospora species. For simplification, we used the most 333
general model of DNA substitution GTR + I + G (Abadi et al., 2019). Phylogenetic 334
relationships were inferred using Bayesian Inference (BI) as implemented on BEAST v.1.8.3 335
(Drummond & Rambaut, 2007). Two independent runs with four Markov Chain Monte Carlo 336
(MCMC) were conducted, sampling every 1,000 generations for 10,000,000 generations. Each 337
run was evaluated in TRACER v.1.7 (Rambaut et al., 2018) to assess MCMC convergence and 338
a burn-in of 25% was applied. We then obtained the consensus phylogeny and clade posterior 339
probabilities with the “sumt” command. 340
341
RESULTS 342
343
Efficiency of target sequences filtering 344
We generated GS data and genome size estimates for Rhynchospora cephalotes and R. 345
exaltata to add to the already sequenced data of the other three Rhynchospora here analysed in 346
order to compare repeat composition from TCS and GS data (Table 1). The Geneious datasets 347
presented a higher number of filtered reads than the BowTie datasets (Table 1). R. pubera 348
presented the smallest number of filtered reads among all five species, with both the BowTie 349
and Geneious filters (3.96% and 12.29% respectively). R. cephalotes had the highest number 350
of filtered reads among the Geneious datasets, while R. exaltata presented the largest 351
proportion among the BowTie datasets. Using the Custom Repeat Database option of 352
RepeatExplorer, the highest amount of “target clusters” was found on R. exaltata (18.45%) and 353
the lowest amount was found on R. tenuis (6.36%). 354
355
preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for thisthis version posted December 11, 2020. ; https://doi.org/10.1101/2020.12.10.419515doi: bioRxiv preprint
https://doi.org/10.1101/2020.12.10.419515
-
12
Repetitive DNA content of different datasets 356
To evaluate the quality of repeat characterization from TCS datasets, the genomic 357
proportion of different repetitive element lineages was compared with those observed in the 358
GS datasets (Table 2, Fig. 1). Generally, the proportion of the total repeat fraction observed in 359
the GS datasets was smaller than the ones observed in all the TCS datasets (Table 2, Figure 360
S1). The raw datasets presented the highest values for total repetitive fraction in almost all 361
species, with the exception of R. globosa, in which the Geneious dataset showed a total of 362
49.41% repeat proportion against 46.74% on the raw dataset. This is also reflected in the 363
difference of number of clusters representing at least 0.01% of total genomic proportion formed 364
in the clustering analysis. While the GS datasets presented cluster numbers varying from 155 365
to 294, filtered TCS datasets ranged from 462 to 589 clusters and raw TCS datasets ranged 366
from 628 to 751. These differences are mostly due to the discrepancy in the proportion of 367
unclassified repetitive elements in the different datasets (Table 2, Fig. S1). Unclassified repeat 368
proportion on GS varied from 6.21% in R. exaltata to 14.70% in R. globosa, while in TCS it 369
accounted for to up to 43.63% (raw TCS of R. exaltata). Overall, proportion of unclassified 370
repeats was smaller in all filtered datasets when compared to raw TCS (Fig. 2a). Additional 371
mapping to the original 256 target regions and separated BLAST searches did not produce any 372
matches for the unclassified clusters, suggesting that these could be either repetitive or non-373
repetitive accidentally enriched genomic regions, too fragmented to be annotated. 374
375
Table 2 – Genomic abundance of repeat types for all datasets in the five analysed Rhynchospora species 376
Species Dataset
To
tal
Rep
eat
fra
ctio
n
Un
cla
ssif
ied
Rep
eats
5S
rD
NA
35
S r
DN
A
Sa
tell
ites
Cla
ss I
/
Un
c. L
TR
s
Cla
ss I
/
LT
R T
y1
/co
pia
Cla
ss I
/
LT
R T
y3
/gy
psy
Cla
ss I
/
Pa
rare
trov
iru
s
Cla
ss I
/LIN
E
Cla
ss I
I/
Su
bcl
ass
1
Cla
ss I
I/
Su
bcl
ass
2
R.
cep
ha
lote
s
GS 14.81 7.39 0.06 0.40 3.76 0.66 0.69 0.95 0.29 0 0.58 0.03
Raw 43.08 38.45 0.11 0.21 2.84 0.51 0.16 0.45 0.05 0.06 0.24 0
Geneious 33.38 28.54 0.15 0.33 2.58 0.77 0.29 0.32 0.08 0 0.32 0
BowTie 34.75 30.29 0.13 0.30 2.52 0.37 0.27 0.47 0.13 0.01 0.26 0
RE 31.17 25.92 0.13 0.25 3.44 0.61 0.19 0.29 0.05 0 0.29 0
R.
glo
bo
sa
GS 38.91 14.70 0.04 0.45 4.19 3.25 1.93 12.51 0.16 0.01 1.67 0
Raw 46.74 27.05 0.11 0.76 1.63 0.84 2.68 12.12 0.17 0.12 1.26 0
Geneious 49.42 25.20 0.10 0.79 1.86 1.08 2.96 15.50 0.20 0.13 1.60 0
BowTie 45.59 23.84 0.12 0.69 1.74 0.39 2.89 14.25 0.18 0.11 1.38 0
RE 42.13 20.52 0.12 0.84 1.79 0.93 2.94 13.29 0.19 0.13 1.38 0
preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for thisthis version posted December 11, 2020. ; https://doi.org/10.1101/2020.12.10.419515doi: bioRxiv preprint
https://doi.org/10.1101/2020.12.10.419515
-
13
R.
ten
uis
GS 17.99 8.23 0.02 0.38 2.61 0.37 5.64 0.47 0 0.03 0.24 0
Raw 27.35 19.53 0.08 0.58 3.06 0.54 1.96 0.74 0.04 0.08 0.74 0
Geneious 24.41 15.59 0.05 0.68 3.45 0.71 2.22 0.77 0.04 0.07 0.83 0
BowTie 23.79 15.36 0.08 0.69 3.19 0.66 2.13 0.71 0.04 0.05 0.88 0
RE 22.37 14.13 0.08 0.62 3.28 0.54 2.00 0.80 0.04 0.09 0.79 0
R.
exa
lta
ta
GS 27.25 6.21 0.10 2.14 16.10 0 2.14 0.44 0 0.07 0.05 0
Raw 49.87 43.63 0.10 1.10 0.91 0.57 1.43 1.72 0 0.13 0.28 0
Geneious 44.28 33.53 0.16 1.31 3.69 0.67 2.01 2.13 0.02 0.11 0.65 0
BowTie 43.67 33.96 0.11 1.04 3.39 0.60 2.24 1.63 0 0.09 0.61 0
RE 41.58 31.53 0.12 1.20 3.48 0.65 1.94 1.99 0.01 0.15 0.51 0
R.
pu
bera
GS 26.67 9.87 0 0.61 3.87 2.12 6.78 0.85 0.46 0.22 1.89 0
Raw 39.11 23.48 0 1.82 1.04 0.48 6.74 4.08 0.46 0.51 0.43 0.07
Geneious 39.11 19.09 0 1.99 1.17 0.67 7.53 6.51 0.55 0.59 1.01 0
BowTie 36.92 20.44 0 2.04 0.91 0.31 7.26 3.81 0.50 0.55 1.1 0
RE 38.31 18.66 0 2.82 0.33 0.51 7.23 6.62 0.49 0.54 1.04 0.07
377
378
379
Figure 1 – Barplots representing genomic abundance of classified repeat types identified in every dataset of 380
Rhynchospora cephalotes (a), R. exaltata (b), R. globosa (c), R. pubera (d) and R. tenuis (e). Bar colours represent 381
different repeat types according to the caption at the lower right corner. 382
383
preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for thisthis version posted December 11, 2020. ; https://doi.org/10.1101/2020.12.10.419515doi: bioRxiv preprint
https://doi.org/10.1101/2020.12.10.419515
-
14
384
Figure 2 – Comparison of genomic proportions of unclassified elements (a) and correlation index (r) with genome 385
skimming data (b) between raw and filtered Target Capture datasets (RepeatExplorer, BowTie and Geneious) of 386
all Rhynchospora species. 387
388
The filtering strategies did not largely impact satellite DNA abundance in the analysed 389
species, with the exception of R. exaltata, in which it was possible to identify ~4× more satellite 390
reads in the BowTie and Geneious datasets than in the raw datasets. Also in R. exaltata, there 391
was a huge discrepancy between the satellite abundance observed in the GS dataset when 392
compared to all TCS datasets (Fig. 1). However, the two satellite DNAs responsible for this 393
abundance difference were also found in all TCS datasets (Table S2, Fig. S2), although in 394
smaller proportions. The amount of satellite DNA was generally lower in the TCS datasets, 395
with the exception of R. tenuis, in which all TCS datasets presented a small increase in satellite 396
abundance when compared to GS (Table 2, Fig. 1). Although in some species some of the 397
satellites found in GS data were not present in every dataset, the most abundant for each species 398
could be identified in all of the TCS datasets. Similarly, some satellites found in TCS datasets 399
were not found in the GS datasets, possibly being low-abundance satellites accidentally 400
enriched (Table S2, Fig. S2). 401
For mobile elements, there was a general agreement in the order of abundance of repeat 402
types found in the GS and TCS datasets, with filtered datasets showing increase in the 403
proportion of annotated elements when compared to the raw TCS dataset (Table 2, Fig. 1), with 404
a few exceptions. For example, in R. exaltata, LTRs from the Ty3/gypsy superfamily were 405
more abundant in the raw TCS, BowTie and RE datasets than in the GS datasets (Table 2, Fig. 406
1). We also compared the abundances at lineage level (Table S1). Patterns of abundance of 407
LTR families from GS and TCS datasets were similar, with most of the genomic abundance of 408
Ty1/copia and Ty3/gypsy superfamilies being result of the amplification of up to four main 409
lineages (Table S1). Generally, LTRs found in GS with abundance as low as 0.01% could also 410
be identified in the target datasets. Surprisingly, in all five species, a greater diversity of LTR 411
preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for thisthis version posted December 11, 2020. ; https://doi.org/10.1101/2020.12.10.419515doi: bioRxiv preprint
https://doi.org/10.1101/2020.12.10.419515
-
15
retroelements was observed in the target capture datasets when compared to GS (Table S1). 412
This led to a few interesting discrepancies, such as in R. pubera, where target capture datasets 413
showed high abundance of Ty3/gypsy/Retand (1 to 1.3%) and Ty3/gypsy/Tekay elements (0.46 414
to 0.50%), despite those not being found in GS data. 415
To statistically compare the results obtained in the different datasets, we checked for a 416
correlation between the classified repeat abundances observed (at lineage level, when possible) 417
in all target capture datasets with the ones observed in the GS datasets (Fig. 3). Although all 418
tests showed significant correlations (p < 0.05), the strength of the correlation varied depending 419
on the dataset. Raw TCS datasets had the weakest correlation in all five species, with filtering 420
of the targeted sequences generally improving the correlation with the GS dataset (Fig. 2b). 421
The strongest correlations for R. globosa (r = 0.92), R. pubera (r = 0.74) and R. tenuis (r = 422
0.79) were observed with the Geneious dataset, while for R. cephalotes and R. exaltata the best 423
correlation was observed on the BowTie dataset (r = 0.78 and 0.75). The RE filtering for R. 424
globosa and R. tenuis did not improve the correlation when compared to the raw TCS dataset 425
(r = 0.85 and r = 0.66 respectively). However, it also showed as strong a correlation as BowTie 426
for R. cephalotes (r = 0.78) and as Geneious for R. pubera (r = 0.74). 427
428
Figure 3 – Correlation between repeat abundances observed on genome skimming and target capture datasets of 429
Rhynchospora species. Genome skimming abundance values are represented on the x axis of each plot, while 430
target dataset abundances are on the y axis. Spearman’s correlation index (r) for each case is plotted in the lower 431
right corner. 432
433
preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for thisthis version posted December 11, 2020. ; https://doi.org/10.1101/2020.12.10.419515doi: bioRxiv preprint
https://doi.org/10.1101/2020.12.10.419515
-
16
Chromosomal localization of the satellite DNA found in the TCS dataset 434
To test whether it was possible to use the TCS data to investigate the chromosomal 435
repeat distribution by FISH, we chose the most abundant repetitive element found in the R. 436
cephalotes TCS datasets. This repeat was a 172-bp satellite DNA with 60% sequence similarity 437
to Tyba of R. pubera (Marques et al., 2015). The in situ hybridization pattern of the R. 438
cephalotes Tyba (RcTyba) variant was similar to the distribution reported in R. pubera. Small 439
foci appeared in interphase nuclei, and a continuous line along both condensed chromatids 440
occurred at all pro- and metaphase chromosomes. Via immuno-FISH using a CENH3-specific 441
antibody and RcTyba repeats, respectively, the holocentric centromere structure of R. 442
cephalotes has been confirmed due to the co-localization of CENH3 and RcTyba (Fig. 4). 443
444
445
Figure 4 – Co-localization of R. cephalotes Tyba repeats and CENH3 in interphase nuclei (a), prometaphase (b) 446
and metaphase chromosomes (c) via 3D-SIM imaging. Both Tyba and CENH3 clearly indicate the presence of 447
holocentromeres in the condensed mitotic chromosomes. 448
449
Repeat abundance and structure found in TCS data reflect phylogenetic 450
relationships 451
preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for thisthis version posted December 11, 2020. ; https://doi.org/10.1101/2020.12.10.419515doi: bioRxiv preprint
https://doi.org/10.1101/2020.12.10.419515
-
17
We used repeat abundances obtained by comparative clustering analysis of BowTie 452
datasets of our five Rhynchospora species to reconstruct phylogenetic relationships. In the 453
comparative clustering analysis of the GS dataset, 1,754,326 concatenated reads were analysed, 454
forming 450 clusters with at least 0.01% genomic abundance. For the BowTie dataset, 455
1,186,605 of concatenated reads were analysed, with 582 clusters representing at least 0.01% 456
of total genomic abundance. Repeat composition varied among species, with the largest 457
clusters of each species being almost or completely absent in the others (Fig. 5a). By using the 458
first 150 most abundant clusters of the comparative analysis, we were able to reconstruct the 459
phylogenetic relationships among the five Rhynchospora species for both the BowTie (Fig. 5b) 460
and GS datasets (Fig. 5c) with high bootstrap support (mean BS = 100 and 98.3 respectively). 461
Using the reads from all clusters identified in the BowTie dataset comparative analysis, the 462
AAF analysis yielded the same relationships with high bootstrap support (mean BS = 100, Fig. 463
4d). Branch lengths of the abundance-based analysis were significantly higher than for the AAF 464
analysis. Despite this, the species relationships observed in the repeat-based phylogenies were 465
congruent with the ones retrieved in the Bayesian analysis with 256 concatenated target loci, 466
with R. cephalotes + R. exaltata forming a clade sister to R. pubera + R. tenuis, and R. globosa 467
sister to both clades (Fig. 5e). 468
preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for thisthis version posted December 11, 2020. ; https://doi.org/10.1101/2020.12.10.419515doi: bioRxiv preprint
https://doi.org/10.1101/2020.12.10.419515
-
18
469
Fig. 5 – Phylogenomics of Rhynchospora species using genome skimming and target capture datasets a) Graphic 470
representation of the 30 most abundant clusters originated from the comparative clustering analysis with the 471
BowTie dataset. Height of rectangles represent the genomic abundance of each cluster, with colours indicating 472
the repeat type according to the caption. Asterisks (*) indicate the clusters that represent Tyba, which is absent in 473
Rhynchospora globosa. Below, phylogenetic trees obtained by repeat abundance of the GS (b) and BowTie (c) 474
preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for thisthis version posted December 11, 2020. ; https://doi.org/10.1101/2020.12.10.419515doi: bioRxiv preprint
https://doi.org/10.1101/2020.12.10.419515
-
19
datasets, AAF of total reads in repetitive clusters (d) and Bayesian inference based on 256 target regions (e). Star 475
in b represents the only node with support below 100 (BS = 99). 476
477
DISCUSSION 478
479
TCS data can be used to characterize the repeat composition of a genome 480
We were able to find most of the repeat diversity of five Rhynchospora species using 481
filtered and unfiltered TCS reads. Depending on the filtering strategy employed, the 482
Rhynchospora data used here showed low abundance of on-target reads, indicating a 483
considerable proportion of off-target reads suitable for repeat analysis. Target-based 484
sequencing approaches often have varied enrichment efficiency, with on-target enriched reads 485
representing as low as 5% of the final genomic library (Johnson et al., 2019). This is 486
particularly the case with universal kits that are designed to work across large taxonomic 487
groups, at the cost of enrichment efficiency. Thus, the off-target sequence reads from such 488
libraries can be used in a similar fashion to genome skimming sequencing, an approach often 489
described as Hyb-Seq (Weitemier et al., 2014; Dodsworth et al., 2019). 490
Our results show that there is a moderate rank correlation between the abundances of 491
annotated repeats obtained by analysing genome skimming and raw target capture datasets, and 492
that this correlation increases when analysing off-target reads only. The significant correlation 493
indexes showed that although annotated repeat proportions of our filtered target datasets are 494
not identical to those observed in GS, repeat composition and order of abundance was highly 495
similar. Although the Geneious dataset presented the highest proportion of filtered reads and 496
highest correlation index with GS data for three of the five species, the BowTie and RE 497
approaches were also sufficient for increasing the correlation index in most cases. Therefore, 498
any of the filtering strategies presented here can produce sufficiently accurate results in regards 499
to repeat identification and order of abundance when compared with the traditional GS 500
approach. It is important to note that, although the patterns of repeat abundance are highly 501
similar, the numerical difference observed for some repeat classes shows that combining GS 502
and TCS data may produce inconsistent results, and exact quantification from these data should 503
be taken with caution. 504
Although the total abundance of satellite DNAs varied between GS and TCS datasets, 505
we were able to identify the most abundant satellite DNA families of all five Rhynchospora 506
species in all target datasets. While some low-abundance satellites (
-
20
abundance) found in GS data were not found in the target datasets, others with abundances as 508
low as 0.09% were found in all datasets, suggesting that satellite abundance does not 509
necessarily affect its presence in the off-target reads (Table S2). More importantly, various 510
satellites found in the TCS datasets were not found in the GS dataset (Table S2), probably being 511
low-abundant satellites accidentally enriched in the sequencing process. Low-abundant 512
satellites may sometimes not be detected by RepeatExplorer, requiring additional filtering to 513
be detected (Ruiz-Ruano et al., 2016), which could explain the absence of these satellites in 514
our GS datasets. 515
One of the benefits of using Rhynchospora as a model for this study was the fact that it 516
possesses satellite DNAs with varying chromosomal distributions. Tyba clusters are dispersed, 517
forming a linear distribution along the metaphase holocentromeres of R. pubera, R. ciliata, R. 518
cephalotes and R. tenuis (Marques et al., 2015; Ribeiro et al., 2017; this study). However, other 519
satellites in Rhynchospora, such as RgSAT1-186 in R. globosa form localized blocks (Ribeiro 520
et al., 2017). Off-target sequences used in Hyb-Seq are often adjacent to the targeted gene 521
regions (Weitemier et al., 2014; Dodsworth et al., 2019), rasing the possibility that our analysis 522
would preferentially identify widespread repeats such as Tyba, which are interspersed with 523
genic regions (Marques et al., 2015). However, RgSAT1-186, identified as sub-terminal 524
clusters in the chromosomes of R. globosa (Ribeiro et al., 2017), was identified as the most 525
abundant cluster on all the R. globosa TCS datasets, showing that the chromosomal distribution 526
of a satellite DNA was not interfering in the randomness of the off-target reads (Fig. S1). 527
As well as confirming its presence in the target datasets of R. pubera and R. tenuis, we 528
were able to find novel Tyba variants in R. cephalotes and in R. exaltata, with 60% and 57% 529
sequence similarity to RpTyba, respectively. R. globosa did not present any Tyba variant, in 530
concordance with the results of Ribeiro et al. (2017). The abundance of Tyba in R. pubera was 531
very low in the target datasets (~0.14%) compared to the GS results (~2.8% here, 3.6% in 532
Marques et al. 2015). In our R. pubera TCS datasets, the most abundant tandem repeat was 533
RpSAT5-287, which appeared as only the fifth most abundant in the GS dataset (Table S2). 534
This satellite was not found in the previous R. pubera characterization and may indicate the 535
potential to discover additional low-abundant sequences using TCS datasets. Although fast 536
rates of evolution for satellite DNAs may lead to intraspecific abundance variation (Ceccarelli 537
et al., 2011), this 20-fold difference is probably too high to be a product of differences between 538
the individuals used for each sequencing method. Satellite abundances estimated by TAREAN 539
depend on several factors, such as sequence coverage, monomer homogeneity and similarity 540
with other repeats (Novák et al., 2017). As abundance information gathered by short reads can 541
preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for thisthis version posted December 11, 2020. ; https://doi.org/10.1101/2020.12.10.419515doi: bioRxiv preprint
https://doi.org/10.1101/2020.12.10.419515
-
21
grossly underestimate the true abundance of this repeat type (i.e. Ribeiro et al., 2020), caution 542
is needed when interpreting TCS-yielded genomic abundances, though similar caution is 543
needed with GS data as well. 544
Transposable elements, particularly LTR-retrotransposons, are often the largest fraction 545
of repetitive DNA in plants (Galindo-González et al., 2017). In our GS datasets, 546
LTR−Ty1/copia was the most abundant repeat type in three out of five species. The TCS 547
datasets yielded similar results to GS, with a few discrepancies, especially in R. exaltata 548
(predominance of Ty3/gypsy instead of Ty1/copia) and R. pubera (significantly higher 549
proportion of Ty3/gypsy than in GS). In order to do a more detailed comparison, we used the 550
individual lineage abundance values for our correlation analysis. The high correlation rates 551
indicate that different LTR lineages, although varying in abundance, contributed similarly to 552
the repetitive fraction in GS and filtered target datasets. Our results show that in addition to 553
being able to find the majority of amplified lineages of LTR retrotransposons, we could also 554
find lineages with abundances as low as 0.01% in the target datasets. We also find a higher 555
diversity of LTR lineages in the target datasets than in the GS data, with a few of these “extra” 556
lineages being over-abundant when compared to the GS dataset (Table S1). These lineages 557
were absent in the GS results probably due to masking by the highly abundant satellite DNA 558
clusters, which were underestimated in some of our TCS datasets. Nonetheless, the fact that 559
we could identify even low abundance LTRs in the target datasets, coupled with the moderate 560
correlation with the GS-yielded abundances, indicate that off-target reads from target 561
sequencing may be sufficient to identify most of the LTR retrotransposons in a genome. 562
563
TCS data can be used to develop cytogenetic probes 564
We tested whether we could use repeat information obtained from TCS data to develop probes 565
for cytogenetic techniques such as fluorescent in situ hybridization (FISH). Highly abundant 566
repetitive elements are frequently chosen for FISH experiments, as they are easier to visualize 567
on chromosomes than low abundance repeats and can often be important components of 568
predominantly heterochromatic and centromeric regions (Marques et al., 2015 Bilinski et al., 569
2017; Ávila Robledillo et al., 2018). In our Rhynchospora species, the most abundant cluster 570
in all datasets was a satellite DNA. For R. tenuis and R. globosa we found the same satellites 571
found previously by Ribeiro et al. (2017) as the most abundant satellites (Tyba and RgSAT1-572
186 respectively). 573
Similar to previous results on other Rhynchospora (Marques et al., 2015; Ribeiro et al., 574
2017), the Tyba variant from R. cephalotes presented a dispersed distribution in interphase 575
preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for thisthis version posted December 11, 2020. ; https://doi.org/10.1101/2020.12.10.419515doi: bioRxiv preprint
https://doi.org/10.1101/2020.12.10.419515
-
22
nuclei and a line-like pattern along the sister chromatids of condensed chromosomes. Co-576
localization with CENH3 further confirmed the holocentromere specific localization of Tyba. 577
The conservation of the holocentromeric distribution of RcTyba on R. cephalotes strengthens 578
its putative role in centromere function as proposed previously (Marques et al., 2015; Ribeiro 579
et al., 2017). It also points to a remarkably old origin for this satellite and its holocentromeric 580
association, sharing a common ancestor between 35-46 My (95% credible interval; 581
Buddenhagen, 2016). A large-scale survey of Tyba in the entire Rhynchospora genus could 582
help to further elucidate the evolution of this satellite and its association with the 583
holocentromere. In light of our results, we believe that using off-target reads from target 584
capture-based phylogenetic studies could be an additional way to find potentially informative 585
cytological markers across a broader sampling and investigate their evolution in a broader 586
context. 587
588
TCS data can be used to construct repeat-based phylogenies 589
In order to test if repeat abundances observed in target datasets are accurate enough to 590
reconstruct phylogenetic relationships, we applied the methodology described by Dodsworth 591
et al. (2015) to our BowTie and GS datasets as well as using an assembly and alignment free 592
method (Fan et al., 2015) using the total set of reads output by the comparative repeat analysis 593
on the BowTie dataset. Dodsworth et al.’s approach takes into account the assumption that, as 594
repeat abundance changes primarily through random genetic drift (Jurka et al., 2011), they can 595
be used as selection-free characters for phylogenetic reconstruction. On the other hand, AAF 596
methods have been shown to identify potentially useful markers for taxonomic resolution from 597
genome skimming datasets (Bohmann et al., 2020). Both analyses were successful in 598
reconstructing the major relationships between the five Rhynchospora species with maximal 599
bootstrap support. Our results not only corroborate that AAF methods can be useful for 600
repetitive sequence data, but also show that it can be employed on the off-target portion of TCS 601
data. These sequence similarity-based approaches (e.g. Vitales et al. 2019) may be more 602
appropriate for groups where no genome skimming data is available for verifying the accuracy 603
of repeat abundance in the target dataset, when abundance-based approaches provide less 604
resolved trees or when genome sizes are unknown. 605
The phylogenetic relationships retrieved by the repeat abundance and AAF-based 606
phylogeny were not only congruent with a Bayesian analysis of the 256 target regions, but also 607
with recent studies in the genus based on other phylogenetic data (Buddenhagen et al., 2016; 608
Ribeiro et al., 2018). Our results show the potential of repeats from off-target reads to be used 609
preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for thisthis version posted December 11, 2020. ; https://doi.org/10.1101/2020.12.10.419515doi: bioRxiv preprint
https://doi.org/10.1101/2020.12.10.419515
-
23
as an additional phylogenetically informative dataset, from a completely different part of the 610
genome typically used for phylogenetic studies. Target capture-based sequencing already 611
offers the opportunity to construct robust phylogenies with hundreds of informative markers 612
and also to assemble whole plastomes and other organellar DNAs via the off-target reads to 613
infer phylogenetic relationships, and nuclear-organellar discordance (Dodsworth et al., 2019). 614
Repeat-based phylogenies offer an additional strategy, based on the same TCS datasets, 615
potentially uncovering nuclear intragenomic (in)congruence, while further increasing the 616
usefulness of TCS datasets. The robustness of the repetitive DNA information obtained from 617
our target datasets can prove useful in a variety of phylogenomic approaches, such as 618
similarity-based repeat phylogenies (Vitales et al., 2019) as well as the alignment-free and 619
abundance-based methods presented here. The ability to characterise genome composition and 620
develop cytogenetic markers from TCS-derived repeat data add further value to these datasets, 621
and insight into the genomic processed underpinning evolution in plants. 622
623
ACKNOWLEGMENTS 624
This study was supported in part by the Coordenacão de Aperfeicoamento de Pessoal de Nıvel 625
Superior–Brasil (CAPES, Finance Code 001), CAPES-PRINT project number 626
88887.363884/2019-00 (LC), and CNPq (Conselho Nacional de Desenvolvimento Científico e 627
Tecnologico; grant number 141037/2018-0 to LC). The authors are also grateful to Dr. 628
Magdalena Vayo (Facultad de Agronomía, Uruguay) for providing comments and suggestions 629
for the manuscript and to Msc. Erton Almeida for the collection of Rhynchospora cephalotes. 630
631
AUTHORS CONTRIBUTION 632
LC and APH conceived the ideas and designed the study with important input from AM, GS, 633
AH and SD; LC, AM, CB, BH, WT and VS collected data; LC, AM and GS performed data 634
analysis; LC wrote the manuscript with comments from all the authors. 635
636
DATA AVAILABILITY 637
The data that supports the findings of this study is openly available in the NCBI GenBank at 638
[www.ncbi.nlm.nih.gov] under BioProjects PRJEB9643, PRJNA672922 and PRJNA672127. 639
640
preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for thisthis version posted December 11, 2020. ; https://doi.org/10.1101/2020.12.10.419515doi: bioRxiv preprint
http://www.ncbi.nlm.nih.gov/https://doi.org/10.1101/2020.12.10.419515
-
24
REFERENCES 641
Abadi S, Azouri D, Pupko T, Mayrose I. 2019. Model selection may not be a mandatory 642
step for phylogeny reconstruction. Nature Communications 10: 934. 643
Albert TJ, Molla MN, Muzny DM, Nazareth L, Wheeler D, Song X, Richmond TA, 644
Middle CM, Rodesch MJ, Packard CJ, et al. 2007. Direct selection of human genomic loci 645
by microarray hybridization. Nature Methods 4: 903–905. 646
Aliyeva-Schnorr L, Beier S, Karafiátová M, Schmutzer T, Scholz U, Doležel J, Stein N, 647
Houben A. 2015. Cytogenetic mapping with centromeric bacterial artificial chromosomes 648
contigs shows that this recombination-poor region comprises more than half of barley 649
chromosome 3H. The Plant Journal 84: 385–394. 650
Ávila Robledillo L, Koblížková A, Novák P, Böttinger K, Vrbová I, Neumann P, 651
Schubert I, Macas J. 2018. Satellite DNA in Vicia faba is characterized by remarkable 652
diversity in its sequence composition, association with centromeres, and replication timing. 653
Scientific Reports 8. 654
Bilinski P, Albert PS, Berg JJ, Birchler J, Grote M, Lorant A, Quezada J, Swarts K, 655
Yang J, Ross-Ibarra J. 2017. Parallel Altitudinal Clines Reveal Adaptive Evolution Of 656
Genome Size In Zea mays. 657
Bohmann K, Mirarab S, Bafna V, Gilbert MTP. 2020. Beyond DNA barcoding: The 658
unrealized potential of genome skim data in sample identification. Molecular Ecology 29: 659
2521–2534. 660
Bolsheva NL, Melnikova NV, Kirov IV, Dmitriev AA, Krasnov GS, Amosova АV, 661
Samatadze TE, Yurkevich OYu, Zoshchuk SA, Kudryavtseva AV, et al. 2019. 662
Characterization of repeated DNA sequences in genomes of blue-flowered flax. BMC 663
Evolutionary Biology 19: 49. 664
Buddenhagen CE. 2016. A view of Rhynchosporeae (Cyperaceae) diversification before and 665
after the application of anchored phylogenomics across the angiosperms. 666
Buddenhagen C, Lemmon AR, Lemmon EM, Bruhl J, Cappa J, Clement WL, 667
Donoghue MJ, Edwards EJ, Hipp AL, Kortyna M, et al. 2016. Anchored Phylogenomics 668
of Angiosperms I: Assessing the Robustness of Phylogenetic Estimates. Evolutionary Biology. 669
preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for thisthis version posted December 11, 2020. ; https://doi.org/10.1101/2020.12.10.419515doi: bioRxiv preprint
https://doi.org/10.1101/2020.12.10.419515
-
25
Bureš P, Zedek F, Markova M. 2013. Holocentric Chromosomes. In: Plant Genome 670
Diversity Volume 2. Vienna: Springer Vienna, 187–204. 671
Ceccarelli M, Sarri V, Caceres ME, Cionini PG. 2011. Intraspecific genotypic diversity in 672
plants (P Donini, Ed.). Genome 54: 701–709. 673
Cheng Z, Dong F, Langdon T, Ouyang S, Buell CR, Gu M, Blattner FR, Jiang J. 2002. 674
Functional Rice Centromeres Are Marked by a Satellite Repeat and a Centromere-Specific 675
Retrotransposon. The Plant Cell 14: 1691–1704. 676
Čížková J, Hřibová E, Humplíková L, Christelová P, Suchánková P, Doležel J. 2013. 677
Molecular Analysis and Genomic Organization of Major DNA Satellites in Banana (Musa 678
spp.) (K Kashkush, Ed.). PLoS ONE 8: e54808. 679
Cosart T, Beja-Pereira A, Chen S, Ng SB, Shendure J, Luikart G. 2011. Exome-wide 680
DNA capture and next generation sequencing in domestic and wild species. BMC Genomics 681
12: 347. 682
Dodsworth S. 2015. Genome skimming for next-generation biodiversity analysis. Trends in 683
Plant Science 20: 525–527. 684
Dodsworth S, Chase MW, Kelly LJ, Leitch IJ, Macas J, Novak P, Piednoel M, Weiss-685
Schneeweiss H, Leitch AR. 2015. Genomic Repeat Abundances Contain Phylogenetic 686
Signal. Systematic Biology 64: 112–126. 687
Dodsworth S, Jang T-S, Struebig M, Chase MW, Weiss-Schneeweiss H, Leitch AR. 688
2017. Genome-wide repeat dynamics reflect phylogenetic distance in closely related 689
allotetraploid Nicotiana (Solanaceae). Plant Systematics and Evolution 303: 1013–1020. 690
Dodsworth S, Pokorny L, Johnson MG, Kim JT, Maurin O, Wickett NJ, Forest F, 691
Baker WJ. 2019. Hyb-Seq for Flowering Plant Systematics. Trends in Plant Science 24: 692
887–891. 693
Dolezel J, Sgorbati S, Lucretti S. 1992. Comparison of three DNA fluorochromes for flow 694
cytometric estimation of nuclear DNA content in plants. Physiologia Plantarum 85: 625–631. 695
Drummond AJ, Rambaut A. 2007. BEAST: Bayesian evolutionary analysis by sampling 696
trees. BMC Evolutionary Biology 7: 214. 697
preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for thisthis version posted December 11, 2020. ; https://doi.org/10.1101/2020.12.10.419515doi: bioRxiv preprint
https://doi.org/10.1101/2020.12.10.419515
-
26
Eaton DAR, Spriggs EL, Park B, Donoghue MJ. 2016. Misconceptions on Missing Data in 698
RAD-seq Phylogenetics with a Deep-scale Example from Flowering Plants. Systematic 699
Biology: syw092. 700
Elliott TA, Gregory TR. 2015. What’s in a genome? The C-value enigma and the evolution 701
of eukaryotic genome content. Philosophical Transactions of the Royal Society B: Biological 702
Sciences 370: 20140331. 703
Faircloth BC, McCormack JE, Crawford NG, Harvey MG, Brumfield RT, Glenn TC. 704
2012. Ultraconserved Elements Anchor Thousands of Genetic Markers Spanning Multiple 705
Evolutionary Timescales. Systematic Biology 61: 717–726. 706
Fan H, Ives AR, Surget-Groba Y, Cannon CH. 2015. An assembly and alignment-free 707
method of phylogeny reconstruction from next-generation sequencing data. BMC Genomics 708
16: 522. 709
Galindo-González L, Mhiri C, Deyholos MK, Grandbastien M-A. 2017. LTR-710
retrotransposons in plants: Engines of evolution. Gene 626: 14–25. 711
Gnirke A, Melnikov A, Maguire J, Rogov P, LeProust EM, Brockman W, Fennell T, 712
Giannoukos G, Fisher S, Russ C, et al. 2009. Solution hybrid selection with ultra-long 713
oligonucleotides for massively parallel targeted sequencing. Nature Biotechnology 27: 182–714
189. 715
Guignard MS, Nichols RA, Knell RJ, Macdonald A, Romila C-A, Trimmer M, Leitch 716
IJ, Leitch AR. 2016. Genome size and ploidy influence angiosperm species’ biomass under 717
nitrogen and phosphorus limitation. New Phytologist 210: 1195–1206. 718
Heyduk K, Trapnell DW, Barrett CF, Leebens-Mack J. 2016. Phylogenomic analyses of 719
species relationships in the genus Sabal (Arecaceae) using targeted sequence capture. 720
Biological Journal of the Linnean Society 117: 106–120. 721
Houben A, Schubert I. 2003. DNA and proteins of plant centromeres. Current Opinion in 722
Plant Biology 6: 554–560. 723
Ilves KL, López-Fernández H. 2014. A targeted next-generation sequencing toolkit for 724
exon-based cichlid phylogenomics. Molecular Ecology Resources 14: 802–811. 725
preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for thisthis version posted December 11, 2020. ; https://doi.org/10.1101/2020.12.10.419515doi: bioRxiv preprint
https://doi.org/10.1101/2020.12.10.419515
-
27
Ishii T, Sunamura N, Matsumoto A, Eltayeb AE, Tsujimoto H. 2015. Preferential 726
recruitment of the maternal centromere-specific histone H3 (CENH3) in oat (Avena sativa L.) 727
× pearl millet (Pennisetum glaucum L.) hybrid embryos. Chromosome Research 23: 709–728
718. 729
Jasencakova Z, Meister A, Schubert I. 2001. Chromatin organization and its relation to 730
replication and histone acetylation during the cell cycle in barley. Chromosoma 110: 83–92. 731
Johnson MG, Pokorny L, Dodsworth S, Botigué LR, Cowan RS, Devault A, Eiserhardt 732
WL, Epitawalage N, Forest F, Kim JT, et al. 2019. A Universal Probe Set for Targeted 733
Sequencing of 353 Nuclear Genes from Any Flowering Plant Designed Using k-Medoids 734
Clustering (S Renner, Ed.). Systematic Biology 68: 594–606. 735
Jurka J, Bao W, Kojima KK. 2011. Families of transposable elements, population structure 736
and the origin of species. Biology Direct 6: 44. 737
Kearse M, Moir R, Wilson A, Stones-Havas S, Cheung M, Sturrock S, Buxton S, 738
Cooper A, Markowitz S, Duran C, et al. 2012. Geneious Basic: An integrated and 739
extendable desktop software platform for the organization and analysis of sequence data. 740
Bioinformatics 28: 1647–1649. 741
Koo D-H, Hong CP, Batley J, Chung YS, Edwards D, Bang J-W, Hur Y, Lim YP. 2011. 742
Rapid divergence of repetitive DNAs in Brassica relatives. Genomics 97: 173–185. 743
Langmead B, Salzberg SL. 2012. Fast gapped-read alignment with Bowtie 2. Nature 744
Methods 9: 357–359. 745
Lemmon AR, Emme SA, Lemmon EM. 2012. Anchored Hybrid Enrichment for Massively 746
High-Throughput Phylogenomics. Systematic Biology 61: 727–744. 747
Loureiro J, Rodriguez E, Dolezel J, Santos C. 2007. Two new nuclear isolation buffers for 748
plant DNA flow cytometry: a test with 37 species. Annals of Botany 100: 875–888. 749
Lyu H, He Z, Wu C-I, Shi S. 2018. Convergent adaptive evolution in marginal 750
environments: unloading transposable elements as a common strategy among mangrove 751
genomes. New Phytologist 217: 428–438. 752
preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for thisthis version posted December 11, 2020. ; https://doi.org/10.1101/2020.12.10.419515doi: bioRxiv preprint
https://doi.org/10.1101/2020.12.10.419515
-
28
Macas J, Novák P, Pellicer J, Čížková J, Koblížková A, Neumann P, Fuková I, Doležel 753
J, Kelly LJ, Leitch IJ. 2015. In Depth Characterization of Repetitive DNA in 23 Plant 754
Genomes Reveals Sources of Genome Size Variation in the Legume Tribe Fabeae (A 755
Houben, Ed.). PLOS ONE 10: e0143424. 756
Mandel JR, Dikow RB, Funk VA, Masalia RR, Staton SE, Kozik A, Michelmore RW, 757
Rieseberg LH, Burke JM. 2014. A Target Enrichment Method for Gathering Phylogenetic 758
Information from Hundreds of Loci: An Example from the Compositae. Applications in Plant 759
Sciences 2: 1300085. 760
Marques A, Ribeiro T, Neumann P, Macas J, Novák P, Schubert V, Pellino M, Fuchs J, 761
Ma W, Kuhlmann M, et al. 2015. Holocentromeres in Rhynchospora are associated with 762
genome-wide centromere-specific repeat arrays interspersed among euchromatin. 763
Proceedings of the National Academy of Sciences 112: 13633–13638. 764
Martín-Peciña M, Ruiz-Ruano FJ, Camacho JPM, Dodsworth S. 2019. Phylogenetic 765
signal of genomic repeat abundances can be distorted by random homoplasy: a case study 766
from hominid primates. Zoological Journal of the Linnean Society 185: 543–554. 767
Mascagni F, Vangelisti A, Giordani T, Cavallini A, Natali L. 2020. A computational 768
comparative study of the repetitive DNA in the genus Quercus L. Tree Genetics & Genomes 769
16: 11. 770
Nagaki K, Talbert PB, Zhong CX, Dawe RK, Henikoff S, Jiang J. 2003. Chromatin 771
immunoprecipitation reveals that the 180-bp satellite repeat is the key functional DNA 772
element of Arabidopsis thaliana centromeres. Genetics 163: 1221–1225. 773
Neumann P, Novák P, Hoštáková N, Macas J. 2019. Systematic survey of plant LTR-774
retrotransposons elucidates phylogenetic relationships of their polyprotein domains and 775
provides a reference for element classification. Mobile DNA 10. 776
Novák P, Ávila Robledillo L, Koblížková A, Vrbová I, Neumann P, Macas J. 2017. 777
TAREAN: a computational tool for identification and characterization of satellite DNA from 778
unassembled short reads. Nucleic Acids Research 45: e111–e111. 779
preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for thisthis version posted December 11, 2020. ; https://doi.org/10.1101/2020.12.10.419515doi: bioRxiv preprint
https://doi.org/10.1101/2020.12.10.419515
-
29
Novak P, Neumann P, Pech J, Steinhaisl J, Macas J. 2013. RepeatExplorer: a Galaxy-780
based web server for genome-wide characterization of eukaryotic repetitive elements from 781
next-generation sequence reads. Bioinformatics 29: 792–793. 782
R Core Team. 2019. R: A Language and Environment for Statistical Computing. Vienna, 783
Austria: R Foundation for Statistical Computing. 784
Rambaut A, Drummond AJ, Xie D, Baele G, Suchard MA. 2018. Posterior 785
Summarization in Bayesian Phylogenetics Using Tracer 1.7 (E Susko, Ed.). Systematic 786
Biology 67: 901–904. 787
Ribeiro T, Buddenhagen CE, Thomas WW, Souza G, Pedrosa-Harand A. 2018. Are 788
holocentrics doomed to change? Limited chromosome number variation in Rhynchospora 789
Vahl (Cyperaceae). Protoplasma 255: 263–272. 790
Ribeiro T, Marques A, Novák P, Schubert V, Vanzela ALL, Macas J, Houben A, 791
Pedrosa-Harand A. 2017. Centromeric and non-centromeric satellite DNA organisation 792
differs in holocentric Rhynchospora species. Chromosoma 126: 325–335. 793
Ribeiro T, Vasconcelos E, dos Santos KGB, Vaio M, Brasileiro-Vidal AC, Pedrosa-794
Harand A. 2020. Diversity of repetitive sequences within compact genomes of Phaseolus L. 795
beans and allied genera Cajanus L. and Vigna Savi. Chromosome Research 28: 139–153. 796
Ruiz-Ruano FJ, López-León MD, Cabrero J, Camacho JPM. 2016. High-throughput 797
analysis of the satellitome illuminates satellite DNA evolution. Scientific Reports 6. 798
Sarmashghi S, Bohmann K, P. Gilbert MT, Bafna V, Mirarab S. 2019. Skmer: assembly-799
free and alignment-free sample identification using genome skims. Genome Biology 20: 34. 800
Sass C, Iles WJD, Barrett CF, Smith SY, Specht CD. 2016. Revisiting the Zingiberales: 801
using multiplexed exon capture to resolve ancient and recent phylogenetic splits in a 802
charismatic plant lineage. PeerJ 4: e1584. 803
Schmickl R, Liston A, Zeisek V, Oberlander K, Weitemier K, Straub SCK, Cronn RC, 804
Dreyer LL, Suda J. 2016. Phylogenetic marker development for target enrichment from 805
transcriptome and genome skim data: the pipeline and its application in southern African 806
Oxalis (Oxalidaceae). Molecular Ecology Resources 16: 1124–1135. 807
preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for thisthis version posted December 11, 2020. ; https://doi.org/10.1101/2020.12.10.419515doi: bioRxiv preprint
https://doi.org/10.1101/2020.12.10.419515
-
30
Schrader L, Schmitz J. 2019. The impact of transposable elements in adaptive evolution. 808
Molecular Ecology 28: 1537–1549. 809
Sonnhammer ELL, Durbin R. 1995. A dot-matrix program with dynamic threshold control 810
suited for genomic DNA and protein sequence analysis. Gene 167: GC1–GC10. 811
Souza G, Costa L, Guignard MS, Van-Lume B, Pellicer J, Gagnon E, Leitch IJ, Lewis 812
GP. 2019. Do tropical plants have smaller genomes? Correlation between genome size and 813
climatic variables in the Caesalpinia Group (Caesalpinioideae, Leguminosae). Perspectives in 814
Plant Ecology, Evolution and Systematics 38: 13–23. 815
Straub SCK, Parks M, Weitemier K, Fishbein M, Cronn RC, Liston A. 2012. Navigating 816
the tip of the genomic iceberg: Next-generation sequencing for plant systematics. American 817
Journal of Botany 99: 349–364. 818
Thomas WmW, Araújo AC, Alves MV. 2009. A Preliminary Molecular Phylogeny of the 819
Rhynchosporeae (Cyperaceae). The Botanical Review 75: 22–29. 820
Van-Lume B, Esposito T, Diniz-Filho JAF, Gagnon E, Lewis GP, Souza G. 2017. 821
Heterochromatic and cytomolecular diversification in the Caesalpinia group (Leguminosae): 822
Relationships between phylogenetic and cytogeographical data. Perspectives in Plant 823
Ecology, Evolution and Systematics 29: 51–63. 824
Vitales D, Garcia S, Dodsworth S. 2019. Reconstructing Phylogenetic Relationships Based 825
on Repeat Sequence Similarities. bioRxiv. 826
Wang H-J, Li W-T, Liu Y-N, Yang F-S, Wang X-Q. 2017. Resolving interspecific 827
relationships within evolutionarily young lineages using RNA-seq data: An example from 828
Pedicularis section Cyathophora (Orobanchaceae). Molecular Phylogenetics and Evolution 829
107: 345–355. 830
Weisshart K, Fuchs J, Schubert V. 2016. Structured Illumination Microscopy (SIM) and 831
Photoactivated Localization Microscopy (PALM) to Analyze the Abundance and Distribution 832
of RNA Polymerase II Molecules on Flow-sorted Arabidopsis Nuclei. BIO-PROTOCOL 6. 833
preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for thisthis version posted December 11, 2020. ; https://doi.org/10.1101/2020.12.10.419515doi: bioRxiv preprint
https://doi.org/10.1101/2020.12.10.419515
-
31
Weiss-Schneeweiss H, Leitch AR, McCann J, Macas J. 2015. Exploring the repeats’ 834
landscape and its impact on genome evolution and plant diversification. In: Germany: Koeltz 835
Scientific Books. 836
Weitemier K, Straub SCK, Cronn RC, Fishbein M, Schmickl R, McDonnell A, Liston 837
A. 2014. Hyb-Seq: Combining Target Enrichment and Genome Skimming for Plant 838
Phylogenomics. Applications in Plant Sciences 2: 1400042. 839
Wickham H. 2016. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New 840
York. 841
842
Supporting Information 843
Table S1 – Detailed annotation of repetitive elements at lineage level for all datasets on all 844
Rhynchospora species. 845
Table S2 – Name and genomic abundance of the satellites found in all datasets of the five 846
Rhynchospora species. 847
Figure S1 - Barplots of genomic abundance of unclassified repeats in every dataset of all 848
analysed Rhynchospora species. 849
Figure S2 – Dotplot comparison of satellites found in all datasets of all analysed 850
Rhynchospora species 851
preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. The copyright holder for thisthis version posted December 11, 2020. ; https://doi.org/10.1101/2020.12.10.419515doi: bioRxiv preprint
https://doi.org/10.1101/2020.12.10.419515